Compact tree node representation of an xml document

ABSTRACT

Methods and systems for creating a compact tree node representation of an XML document. One implementation commences by allocating memory for storing an XML tree index data structure, then allocating another separate portion of memory to store a hash table. Then, traversing an XML document to process the traversed nodes as follows: (a) when the traversed node is an element node, then adding the element node to the XML tree index data structure (b) when the traversed node is a text node, then populating a text node index into the XML tree index data structure and copying the text node values to the hash table, and (c) when the traversed node is an attribute node, then populating an attribute node index into the XML tree index data structure. Such a structure supports fast index-based tree restructuring, and permits very large XML document to be accessed within tight memory size constraints.

RELATED APPLICATIONS

The present application is a divisional application of U.S. applicationSer. No. 13/459,901, which was filed on Apr. 30, 2012 and entitled“COMPACT TREE NODE REPRESENTATION OF AN XML DOCUMENT” which claimed thebenefit of priority to U.S. Provisional Patent Application Ser. No.61/542,181, filed Oct. 1, 2011, entitled “DATA STRUCTURE REPRESENTATION,MEANING AND PASSING” which are all hereby incorporated by reference intheir entirety.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever.

FIELD

The disclosure relates to the field of managing XML documents and moreparticularly to techniques for creating a compact tree noderepresentation of an XML document.

BACKGROUND

Embodiments of the present disclosure are directed to an improvedapproach for creating a compact tree node representation of an XMLdocument.

In various use cases, legacy representation of data structures withinenterprise data processing applications is inconvenient, and in somecases unwieldy. For example, in storing and accessing an XML document,it is known that when using (legacy) DOM processing implementations toaccess an XML document, the entire DOM data structure must be stored inmemory. For some data processing applications, especially thoseinvolving kernel processing, the DOM processing runs out of memory andcrashes when generating a large volume of data (e.g., a large number ofinvoices). One legacy approach has been to split the processing intoseveral jobs, each with a smaller amount of data (e.g., a smaller numberof invoices). However, this technique is often inconvenient to employ,and moreover this technique does not work well in situations where it isa priori unknown how to split the processing into smaller amounts ofdata.

The aforementioned legacy technologies do not have the capabilities tocreate a compact tree node representation of an XML document that can beaccessed portion by portion without requiring that the entire XMLdocument to reside in memory. And, further, what is needed is a newmethod that uses significantly less memory compared to DOM—even whenloading a complete XML document into memory. Therefore, there is a needfor an improved approach for creating and using compact tree noderepresentations of an XML document.

SUMMARY

The present disclosure provides an improved method, system, and computerprogram product suited to address the aforementioned issues with legacyapproaches. More specifically, the present disclosure provides adetailed description of techniques used in methods, systems, andcomputer program products for creating a compact tree noderepresentation of an XML document.

Disclosed herein are methods, systems and a computer program product forcreating a compact tree node representation of an XML document. Thedisclosed techniques permit a compact, index-based representation of thenodes of a very large XML document to be loaded into memory, whilekeeping the contents of the XML document nodes in one or more separatedata structures that can be accessed as needed. In some cases the entireXML document (in its compact representation) is loaded into memory and,by manipulating only indexes, the XML tree can be restructured. Insituations where the contents of the XML document nodes are needed forprocessing (e.g., the aforementioned restructuring), the contents can beaccessed using paging memory such that the contents of the XML documentthat is being processed can be loaded into memory, and the contents ofthe XML document that have been processed (or otherwise not needed inmemory) can be paged out to external storage.

One implementation commences by allocating memory for storing an XMLtree index data structure, allocating another separate portion of memoryto store one or more hash tables, then traversing an XML document toprocess the traversed nodes as follows: (a) when the traversed node isan element node, then adding the element node to the XML tree index datastructure; (b) when the traversed node is a text node, then populating atext node index into the XML tree index data structure and copying thetext node values to a hash table; and (c) when the traversed node is anattribute node, then populating an attribute node index into the XMLtree index data structure and copying the attribute node values andattribute node name to a hash table. Noncontiguous memory blocks can beused to hold the contents of the hash tables, and the memory blocks canbe paged into memory or paged out of memory as needed.

Further details of aspects, objectives, and advantages of the disclosureare described below in the detailed description, drawings, and claims.Both the foregoing general description of the background and thefollowing detailed description are exemplary and explanatory, and arenot intended to be limiting as to the scope of the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a system for creating a compact tree node representation ofan XML document, according to some embodiments.

FIG. 1B is a flowchart for constructing a compact tree noderepresentation of an XML document, according to some embodiments.

FIG. 1C is a flowchart for generating a report based on a compact treenode representation of an XML document, according to some embodiments.

FIG. 2A depicts an enterprise software application within a system usinga DOM XML tree model class for processing an XML document, according tosome embodiments.

FIG. 2B depicts a transformation into an index for implementing acompact tree node representation of an XML document, according to someembodiments.

FIG. 2C depicts an enterprise software application in a system forimplementing a compact tree node representation of an XML document,according to some embodiments.

FIG. 3A is a diagram of a tree node representation of an XML document,according to some embodiments.

FIG. 3B shows various components of XML tree index structures forimplementing compact tree node representation of an XML document,according to some embodiments.

FIG. 4A is a schematic of a component of an XML tree index structure,according to some embodiments.

FIG. 4B is an array-oriented schematic of an XML tree index structureused in a compact tree node representation of an XML document, accordingto some embodiments.

FIG. 5 is a flowchart for using a system having a compact tree noderepresentation of an XML document, according to some embodiments.

FIG. 6 depicts a block diagram of a system to perform certain functionsof a computer system, according to some embodiments.

FIG. 7 depicts a block diagram of an instance of a computer systemsuitable for implementing an embodiment of the present disclosure.

DETAILED DESCRIPTION

Some embodiments of the present disclosure are directed to an improvedapproach for implementing compact tree node representation of an XMLdocument. More particularly, disclosed herein are environments, methods,and systems for creating and using compact tree node representations ofan XML document.

Overview

For various use cases of data structure representation, meaning andpassing within enterprise data processing applications, legacyrepresentation of the data is inconvenient, and in some cases unwieldy.For example, in DOM access, an XML data structure must be stored inmemory (i.e., typically, DOM XML processing is used only when the XMLdata structure can be in memory). For some data processing applicationsinvolving kernel processing, the DOM processing runs out of memory andcrashes when generating a large volume of data (e.g., a large number ofinvoices). One legacy approach has been to split the processing intoseveral jobs, each with a smaller amount of data (e.g., a smaller numberof invoices). However, the job-splitting technique is often inconvenientto employ, and moreover this technique does not work well in situationswhere it is a priori unknown how to split the jobs in order to usesmaller amounts of data in memory.

One feature of the compact tree node representation of an XML documentin accordance with the disclosed embodiments is that the compact treenodes can represent a huge XML tree, i.e., and can represent a hugeamount of XML relationships in memory compared to using DOM processing.The compact tree node representation as disclosed herein addresses thefollowing:

-   -   The compact tree node representation consumes much less memory        as compared to the aforementioned legacy DOM implementations;    -   The aforementioned legacy DOM implementations require the entire        XML tree to be read into memory at one time, thus limiting the        practical size of an XML tree to be processed;    -   The compact tree node representation can take advantage of        discrete memory allocations and does not require contiguous        memory allocations. Implementations of the compact tree node        representation allocates memory in memory blocks (e.g., in        discrete allocations of 64K byte arrays). If the XML tree is        larger than available memory, then using the herein disclosed        compact tree node representation, memory blocks can be easily        paged to and from external storage (such as a disk drive).

DESCRIPTIONS OF EXEMPLARY EMBODIMENTS

FIG. 1A is a system 1A00 for creating a compact tree node representationof an XML document. As an option, the present system 1A00 may beimplemented in the context of the architecture and functionality of theembodiments described herein. Also, the system 1A00 or any aspecttherein may be implemented in any desired environment.

FIG. 1A shows an XML tree structure 100 transformed into compact XMLtree node data structures 170. The virtual address space 130 comprisesan in-memory instance of a compact XML tree node data structure (e.g.,compact XML tree node data structure 170 ₁), only a portion of whichcompact XML tree node data structures 170 needs to be in memory at anymoment in time; other portions can be paged-out (e.g., see compact XMLtree node data structure 170 ₂).

In some embodiments an XML tree structure 100 comprises various types ofnodes (e.g., element nodes, attribute nodes, and text nodes). Each nodein the XML tree structure can be transformed into data values within thecompact XML tree node data structures 170. For example, each node in theXML tree structure can be transformed so as to be represented by anaddress or pointer (e.g., an index value, a hash value) that can referto a further structure (e.g., a memory block 160, a hash table 140,etc.). In some cases, a given node in the compact XML tree node datastructure 170 can be (at least partially) represented by an index into amemory block, or by a pointer in the form of a hash value, which hashvalue in turn refers to a hash bin (e.g., hash bin 141 _(P), etc.) in ahash table 140. It is possible that each type of node in the XML treecan be represented by a different structure, which structure is definedbased on the characteristics of the node type to be stored. Someexamples are discussed below.

One possible flow in accordance with the embodiment of FIG. 1A beginswith an XML document being stored in a persistent storage device (e.g.,a disk drive, etc.). Such an XML document might have been written to thepersistent storage device 102 ₁ by a report generator or otherenterprise software application. In some cases, such an XML document canbecome voluminous, and in some cases much larger than the main memory ofa computer, thus precipitating the need for the herein-disclosedtechniques.

Continuing discussion of the system 1A00, an XML tree structure 100 canbe read by a compact tree node constructor 120, and the compact treenode constructor can output a compact tree node representation asdepicted in the form of an XML tree index 150 and a hash table 140 (asshown within the virtual address space 130).

While constructing the XML tree index 150, the compact tree nodeconstructor 120 can populate the XML tree index 150 with the minimumamounts of data needed to uniquely identify a particular node. In somecases, the contents (e.g., text, attributes, etc.) of the node can bestored in other structures. For example, any given memory block (e.g.,memory block 160 ₁, memory block 160 ₂, memory block 160 ₃, memory block160 ₄, etc.) included an XML tree index 150 might be populated with onlyelement nodes, or only text nodes, or only attribute nodes, or onlyother types of nodes as may be present.

As can be understood, using such an organization, the XML data treestructure can represent a large XML tree, yet requires only a minimalportion of the XML data tree structure to be present in main memory atany moment in time. In exemplary embodiments, as the compact tree nodeconstructor processes an XML tree, portions of the XML tree index 150and portions of the hash table 140 can be written to a persistentstorage device 102 ₂ in the form of a compact XML tree node datastructure 170.

The embodiment of FIG. 1A supports the range of XML node types, namelyelement nodes, attribute nodes, and text nodes. As shown, an elementnode is designated as “E<n>” where <n> is a numeric value, a text nodeis designated as “T<n>” where <n> is a numeric value, and an attributenode is designated as “A<n>” where <n> is a numeric value. Additionaltypes can be readily added without departing from the scope of thedisclosure.

In some situations, a tree node may have multiple child nodes, and moreparticularly, a tree node may have multiple child text nodes. Onepossible approach for compact storage is to store the multiple textnodes in the same hash bin, and use an offset value to refer to aparticular offset location within the given hash bin. As shown, eventhough the text node T1 and the text node T2 both refer to the same hashbin 141 _(P), each text node T1 or T2 can be individually addressedusing the combination of the hash bin address and a corresponding offsetvalue.

FIG. 1B is a flowchart 1B00 for constructing a compact tree noderepresentation of an XML document such as is given in FIG. 1A. As anoption, the present flowchart 1B00 may be implemented in the context ofthe architecture and functionality of the embodiments described herein.Also, the flowchart 1B00 or any aspect therein may be implemented in anydesired environment.

The flow of flowchart 1B00 commences by accessing an XML document from apersistent storage device such as persistent storage device 102 ₁ (seeoperation 1B10). In some cases, the accessed XML document issufficiently small that it can be accessed via a standard implementationof a data object model (DOM) and held in its entirety in memory (seeFIG. 2A). Or, the accessed XML document is sufficiently large that itcan advantageously be accessed using a streaming technique (e.g., SAX)in order to traverse the XML, document (see operation 1B20). Then, anXML, tree index can be constructed node-by-node as the XML document istraversed, and during the course of node-by-node construction, an XMLtree index is populated with (at least) the data needed to uniquelyidentify a particular node. In exemplary cases, a node has childrennodes, and the index(es) of the child nodes can be back-annotated intothe parent node (see operation 1B30). Further, a node can comprise data.For example, a text node can comprise any amount of text, and attributenode can comprise any extent of attributes and attribute values, etc.The text of a text node, and/or the attributes of an attribute node,etc. can be stored in a hash table. Accordingly the flow of flowchart1B00 allocates a hash table, and stores text portions of the XMLdocument into hash bins within the hash table (see operation 1B40). Atany point in time, some (or all) of the index and and/or some (or all)of the hash table can be written to persistent storage (see operation1B50) and construction can continue.

FIG. 1C is a flowchart for generating a report based on a compact treenode representation of an XML document As an option, the presentflowchart 1C00 may be implemented in the context of the architecture andfunctionality of the embodiments described herein. Also, the flowchart1C00 or any aspect therein may be implemented in any desiredenvironment.

The flow of flowchart 1C00 commences by accessing an XML document from adatabase (see operation 1C10) and the XML document is traversed usingdatabase retrieval techniques in order to process the XML document (seeoperation 1C20). An XML tree index can be constructed node-by-node asthe XML document is traversed, and during the course of node-by-nodeconstruction, an XML tree index is populated with (at least) the dataneeded to uniquely identify a particular node. In exemplary cases, anode has children nodes, and the index(es) of the child nodes can beback-annotated into the parent node (see operation 1C30). The flow offlowchart 1C00 allocates a hash table, and stores text portions of theXML document into hash bins within the hash table (see operation 1C40).At any point in time, some (or all) of the index and and/or some (orall) of the hash table can be accessed by a report generator (seeoperation 1050) and a report can be output in any known form (e.g.,stored to a database, passed to a print queue, etc.).

FIG. 2A depicts an enterprise software application within a system 2A00using a DOM XML tree model class for processing an XML document. As anoption, the present system 2A00 may be implemented in the context of thearchitecture and functionality of the embodiments described herein.Also, the system 2A00 or any aspect therein may be implemented in anydesired environment.

As shown, an XML tree structure 100 is stored in persistent storagedevice 102 ₁, and is accessed by an enterprise software application 205using a DOM XML tree model class 204. As earlier described, the legacyDOM XML tree model suffers from main memory size/usage requirements thatare directly proportional to the size of the XML tree structure 100.That is, a disk-resident instance of an XML tree structure 100 havingelements {A, B, C, D, E, F, G and H} is brought into memory 103 via theDOM (as shown) in the form of a memory-resident instance of an XML treestructure 100 having elements {A, B, C, D, E, F, G and H}. As the sizeof the XML tree structure 100 grows, so do the memory requirements.

The DOM memory-requirement problem (and other DOM problems) can beameliorated by using an index or other advantageously-defined structure.And some embodiments of this disclosure implement techniques forcreating and using an XML tree index data structure.

FIG. 2B depicts a transformation 2B00 into an index for implementing acompact tree node representation of an XML document. As an option, thepresent transformation 2B00 may be implemented in the context of thearchitecture and functionality of the embodiments described herein.Also, the transformation 2B00 or any aspect therein may be implementedin any desired environment.

As shown, an XML tree structure 100 is transformed into an index. TheXML tree structure 100 comprises elements {A (node 109 ₀), B (node 109₁), C (node 109 ₂), D (node 109 ₃), E (node 109 ₄), F (node 109 ₅), G(node 109 ₆), and H (node 109 ₇)}. The transformation 2B00 results in anXML tree index data structure 202 ₁, which XML tree index data structurecomprises an array of rows wherein each row holds a node identifier. Forexample, the node 109 ₀ can be held in a row having the node identifier“A”. Of course, it is known in the art that an element node in awell-formed XML document can comprise a wide range of identifiers andstill remain in accordance with the syntax of a well-formed XMLdocument.

In addition to the node identifier, a row can comprise a pointer orother reference to a sibling node or a child node. In some cases, and asshown, a node (e.g., node “A”) might have multiple children, and the rowcan comprise multiple pointers or other references to multiple childnodes.

In some cases, the XML tree index data structure 202 ₁ is constructed bya depth-first traversal of the given instance of XML tree structure 100;however, it can also be constructed by a breadth-first traversal of thegiven instance of XML tree structure 100. The selection of a depth-firsttraversal or a breadth-first traversal can be made on the basis of theintended use model.

The relationships between nodes 109 serve to define the XML treestructure. In this embodiment, element nodes are the only nodes thathave children. The children of an element are defined as its first childplus all other nodes that can be reached by transitive next-siblingrelationships from that first child. Each element node has a name,sometimes referred to as a “tag name”. In XML strings, the tag namesenclose all strings generated from their child nodes. An example XMLstring for an element tag name is <tag123> child values go here</tag123>. An XML tree structure using XML strings can be constructedusing known techniques.

FIG. 2C depicts an enterprise software application in a system 2C00 forimplementing a compact tree node representation of an XML document. Asan option, the present system 2C00 may be implemented in the context ofthe architecture and functionality of the embodiments described herein.Also, the system 2C00 or any aspect therein may be implemented in anydesired environment.

As shown, the system 2C00 implements a computer method for creating acompact tree node representation of an XML document. The method can beimplemented within an enterprise software application 205 using acompact XML tree model class 206. The compact XML tree model class 206reads from the persistent storage device 1023 and produces the XML treeindex data structure 2022 and a hash table 140. More specifically, thecompact XML tree model class 206 begins by allocating a first portion ofthe memory to store an XML tree index data structure, then traversingthe XML document from a first node (e.g., “A”) to a final node (e.g.,“H”), while processing the intermediate nodes (e.g., B, C, D, E, F, G).While processing the traversed nodes, the processing performstests/checks and takes certain processing steps, as follows:

-   -   When the traversed node is an element node, then adding the        element node to the XML tree index data structure 2022, then;    -   Copying the element node values (if any) to the hash table.

The XML tree index data structure 2022 can be implemented in an array,possibly a sparsely populated array, or it can be implemented using alinked list (or other convenient data structure). A given node can havechildren, and such a given node can point to its one or more child nodesvia a pointer in the array (as shown).

FIG. 3A is a diagram 3A00 of a tree node representation of an XMLdocument. As an option, the present diagram 3A00 may be implemented inthe context of the architecture and functionality of the embodimentsdescribed herein. Also, the diagram 3A00 or any aspect therein may beimplemented in any desired environment.

The foregoing XML tree structures are merely illustrative for thepurposes of the present disclosure, and more detailed and larger XMLtree structures are reasonable and envisioned. For example, thedepiction of FIG. 3A gives an XML tree structure having a larger numberof element nodes 110 (e.g., element node 110 ₁, element node 110 ₂,element node 110 ₃, etc.), as well as some text nodes 111 (e.g., textnode 111 ₁, text node 111 ₂,), and some attribute nodes 112 (e.g.,attribute node 112 ₁, attribute node 112 ₂, etc.).

Also shown are relationships between nodes, namely sibling relationshipsand child relationships. In this embodiment, each attribute node 112 isassociated with an element and is stored as a child of that element. Anelement can have any number of attributes. An attribute consists of aname value and a text value. In XML strings, the attributes appear tothe right of the element's tag name. For example, <tag123atribute_name_1=“attr_value_1” attr_name_2=“attr_val_2”></tag123>.

The text nodes store the text values that are associated with anelement. In XML strings, they are shown enclosed by open and close tagsbearing the tag's name. For example, <tag123>text value from textnode</tag123>. An element can have any number of text nodes; however,when represented as an XML string, all text nodes are concatenated intoone text value for that element.

For all node types, the names and text values can be stored in a hashtable. One of the attributes of using a hash table is that any duplicatename values and/or any duplicate text values can be stored only once inthe hash table, and any duplicates encountered will not requireadditional memory for storing the duplicate encountered text values. Alltext values in the hash table can be retrieved using an index and anoffset (see examples in FIG. 1A, FIG. 4A and FIG. 4B). The offset can bezero.

FIG. 3B shows various components of XML tree index structures 3B00 forimplementing compact tree node representation of an XML document. As anoption, the present XML tree index structures 3B00 may be implemented inthe context of the architecture and functionality of the embodimentsdescribed herein. Also, the XML tree index structures 3B00 or any aspecttherein may be implemented in any desired environment.

FIG. 3B is a tabular representation of an XML tree index data structure3B00. More particularly, the XML tree index data structure contains nodeidentifiers in the tabular format of the XML tree element node array3B04. As earlier indicated, names and values (e.g., node names and textnode values) can be stored in hash tables, and the embodiment of the XMLtree index data structure 202 ₃ and XML tree index data structure 202 ₄are examples of such a storage regime. In one embodiment following thisregime, a first part of the XML tree index data structure 3B00 storesonly indexes, and the second part stores node names and indexes. Thisorganization allows for a compact index (e.g., completely in-memoryindex) to represent an extremely large XML document.

As shown, the nodes in the XML tree are represented by an index value orhash value, which points to a location of a structure in one of thearrays. Each type of node in the XML tree can be organized in astructure applicable to the particular node type. Some suchorganizations of node representations are shown below, following theexample of grouping 3B02:

Element node (e.g., “E4”)

-   -   Hash table index for element name (e.g., “#E4tag”);    -   XML tree index for next sibling (e.g., “E5index”);    -   XML tree index for first child (e.g., “T4index”).

Text node (e.g., “T4”)

-   -   Text string (e.g., “#T4string”);    -   Next Sibling (e.g., “A4index”).

Attribute node (e.g., “A4”)

-   -   Hash table index for attribute value (e.g., “A4value”);    -   Hash table index for attribute name (e.g., “A4name”);    -   Index of next sibling (e.g., “E7index”).

Other data values can be stored in similar structures such as:

Hash table index for attribute value;

XML tree index for next sibling;

Hash table index for text string;

XML tree index for next sibling.

Now, following the organization in this embodiment, it can be seen thatit is possible that only the XML tree element node array 3B04 and thecorresponding XML tree index data structure (e.g., XML, tree index datastructure 202 ₃) need be brought into the computer's main memory, andyet comprise a compact representation of every node in the entire XMLtree. As shown, the XML tree index data structure 202 ₃ comprises hashtable indexes (e.g., hash table indexes 3B06 ₁) so as to associate agiven node (e.g., “E4”) with a particular hash table index (e.g.,“#E4tag”). The data of the nodes can be stored in memory blocks, and beretrieved on demand (e.g., by paging to/from disk storage).

The data of the nodes can be stored in an organization to support fastaccess to a given node. As shown, the data structure forming the XMLtree text node array 3B08 is organized in conjunction with the XML treeindex data structure 202 ₄ so as to associate a given text node with itsrespective hash table index (e.g., see hash table indexes 3B06 ₂), andthe XML tree attribute node array 3B10 is organized so as to associate agiven attribute node (e.g., “A4”) with a hash table index (e.g., hashtable index 3B06 ₃) for the attribute value, and a hash table index(e.g., hash table index 3B06 ₄) for the attribute name. Conveniently, anext sibling index is given in the XML tree index data structure 202 ₅.

As shown, the root node has an index of “Root” (the root node is byconvention the first node in the arrays). It follows that no XML nodecan include the root node as a sibling or as a child, thus the valuezero can be used when a node does not have a next sibling. The indexvalue zero can also be used for the first child of elements that do nothave children.

The children of an element can be stored for linked access, which accessstarts with the “first child” node of the element, and then traversesthrough next sibling nodes until a node without a next sibling isreached. The attributes of an element are stored in its list ofchildren. There can be a significance to the order of the attributeswithin the child-linked list, however sometimes the attributes have nosignificance from other elements or text nodes in the linked list. Also,there can be a significance to the order of the elements and text nodeswithin the child linked list, however sometimes the order of elementsand text nodes have no significance, and can be interspersed in anyorder. Or, any order dependence of text nodes in the same child list canbe maintained (e.g., by virtue of the order in which those text valuesconcatenated in storage).

In this specific embodiment, the stored values can be represented asfollows

A hash table index entry is represented in 32 bits, organized as:

-   -   16 bits for storing the offset of the text string within the        hash bin;    -   16 bits for storing the index of the hash bin within the hash        table.

An XML tree index entry is represented in 32 bits, organized as:

-   -   4 bits for storing the type of XML tree node (e.g., 0=element,        1=text node, 2=attribute, etc.);    -   12 bits for storing an index into an array of memory blocks of        the corresponding node type;    -   16 bits for storing an index into an array of node structures in        the memory block.

Still further, regarding this specific embodiment, the stored values canbe represented as follows:

Each element node comprises of 12 bytes, 4 bytes each for:

-   -   Hash table index for element name (tag name);    -   XML tree index for next sibling;    -   XML tree index for first child.

Each attribute node consists of 12 bytes, 4 bytes each for:

-   -   Hash table index for attribute name;    -   Hash table index for attribute value;    -   XML tree index for next sibling.

Each text node consists of 8 bytes, 4 bytes each for:

-   -   Hash table index for text string;    -   XML tree index for next sibling.

As depicted, a node is represented compactly, using only hash tableindexes and XML tree indexes, thus a representation of the entire XMLtree, comprising all constituent nodes, can be compactly represented andbrought into computer's main memory at one time.

FIG. 4A is a schematic 4A00 of a component of an XML tree indexstructure. As an option, the present schematic 4A00 may be implementedin the context of the architecture and functionality of the embodimentsdescribed herein. Also, the schematic 4A00 or any aspect therein may beimplemented in any desired environment.

As indicated in the schematic 4A00, tree node arrays are implementedusing memory blocks (e.g., MB1, MB2, etc.). This technique allows memoryto be allocated on demand for each type of tree node. Further, the useof memory blocks allows the total array sizes to be increased on demandto the full extent required for representation of the XML tree structure100, while the disclosed use of memory blocks as shown in the depictedXML tree index data structures (XML tree index data structure 202 ₅, XMLtree index data structure 202 ₆, XML tree index data structure 202 ₇,etc.) eliminates the need to allocate memory blocks that are allocatedin consecutive memory locations.

As shown here, a given node can be represented in an array correspondingto its type (e.g., element node, text node, attribute node, etc.) andeach node has its associated location within a memory block. Directaccess to a node is accomplished using a memory block indicator (e.g.,MB1, MB2, etc.) and array offset (e.g., E1offset, T1offset, etc.). Ofcourse, the schematic 4A00 depicts merely one technique for indexingnodes using an associated memory block locator. As shown, the embodimentuses the corresponding memory block indicator in the memory block indexcolumn and the associated array offset given in the array offset column.

Following this hierarchical approach, the use of memory blocks allowsthe total array sizes to be increased on demand to the full extentrequired for representation of the XML tree structure 100, while thedisclosed uses of memory blocks eliminates the need to allocateconsecutive memory blocks.

FIG. 4B is an array-oriented schematic 4B00 of an XML tree indexstructure used in a compact tree node representation of an XML document.As an option, the present schematic 4B00 may be implemented in thecontext of the architecture and functionality of the embodimentsdescribed herein. Also, the schematic 4B00 or any aspect therein may beimplemented in any desired environment.

Strictly as one embodiment of the shown XML tree index data structure2028, the bit-wise sizes of stored data items can be:

Node index within an XML tree index=32 bits in bit fields as follows:

-   -   Type=4 bits        -   Example: 0=element node, 1=text node, 2=attribute node    -   Memory Block Index=12 bits    -   Array Offset=16 bits

In this embodiment, the hash table consists of 65,536 hash bins. Givenany fixed size of a hash table, it can happen that more than one textstring is mapped (via its hash value) to the same hash bin. When morethan one text string maps to the same bin, the strings can be delimitedby a left arrow “<”, which is a character that is not allowed within theXML names or text values. As an alternative, text strings can beconverted to UTF8 (or otherwise formatted) strings before they are addedto the hash table. In order to address more than one string in a hashbin, an offset number is used to indicate which string within that hashbin is to be addressed. Within each hash table bin string, it ispossible that individual text strings can be located and enumerated bycounting offsets, or by counting the number of “<” characters from thebeginning of the string. New text values can be added by appending themto the end of their hash bin string, and adding a terminating delimiter,and using the corresponding array offset.

FIG. 5 is a flowchart 500 for using a system having a compact tree noderepresentation of an XML document. As an option, the present flowchart500 may be implemented in the context of the architecture andfunctionality of the embodiments described herein. Also, the flowchart500 or any aspect therein may be implemented in any desired environment.

Exemplary uses for embodiments of this disclosure include storing XMLdata, where that data has been generated (e.g., by a report generator)and the generated structures are used and reused. Another exemplary usewould be when the XML text data never changes, or rarely changes. Thecompact XML tree node data structure can be easily changed by changingthe index values in the node arrays. The compact XML tree node datastructure is designed for storing very large XML structures usingminimal memory space.

Accordingly, a system might follow the flow of FIG. 5 in order todetermine when to implement a compact XML tree node data structure, orwhen to use other techniques (e.g., such as DOM). As shown, a systemthat implements operations of the flowchart 500 commences by analyzingcharacteristics of a use model (see operation 510), and on the basis ofthat determination, especially after determining if the XML data israpidly changing or not (see decision 512), the system will move to usethe DOM model (see operation 550) or, otherwise, to use techniques asdisclosed herein, namely to create a compact XML tree node datastructure (see operation 530) and access that compact XML tree node datastructure 170 using the heretofore disclosed data structures (seeoperation 540).

Many computer systems can implement the disclosed methods for creating acompact tree node representation of an XML document. For example, forcreating instances of the XML tree index structures 3B00, the method cancommence upon allocating a first portion of memory to store a firstmemory block for storing an XML tree index data structure, and thenallocating a second portion of the memory to store the hash table. Then,an XML document (e.g., an XML tree structure 100) can be transformedinto an XML tree index data structure by traversing the XML documentfrom a first node to a final node while processing the nodes (includingany intermediate nodes) as follows:

-   -   When the traversed node is an element node, then adding the        element node to the XML tree index data structure (possibly also        placing the element node name in the hash table);    -   When the traversed node is a text node, then populating a text        node index into the XML tree index data structure and copying        the text node values to the hash table;    -   When the traversed node is an attribute node, then populating an        attribute node index into the XML tree index data structure and        copying the attribute node values to the hash table;    -   Determining when the first memory block is fully populated        (e.g., or is known or calculated to overflow upon an attempted        copy-in of a node), then storing the first memory block to the        persistent storage device and then allocating a next portion of        memory.

The foregoing techniques serve to create and store large trees, yetaccess to the resulting trees can be done with only needed portionsresident in the computer's main memory.

Still other improvements are possible. For example, it is possible todetermine when the intermediate node is a text node that has alreadybeen copied to the hash table, and then not copying the text node valuesto the hash table, but instead merely using the hash value that pointsto the same text node. This avoids unnecessary duplication of textnodes.

It is also convenient to determine when a hashed text node to beinserted has the same hash bin as a previously hashed text node, andthen storing the text node at an offset from the same hash bin. Thisallows for hash table collisions, yet handles such a collisiongracefully.

Of course, the data of a text node might be conveniently stored using aparticular text encoding scheme. For example, some embodiments convertthe data of a text node into UTF8 strings.

Dividing the XML document into an index portion and a hash table portionsupports effective use of memory by paging-in the node data only asneeded (and paging-out when the node data is no longer needed). In acomputer implemented embodiment, one method for reading a compact treenode representation of an XML document comprises reading an entire XMLtree index data structure into a memory, and then processing the XMLtree index data structure, by accessing the node data only when needed.In some embodiments, he processing includes determining when the node isa text node, then paging into memory a text node value from a hashtable. Similarly, determining when the node is an attribute node, thenpaging into the memory an attribute node value from a hash table, and soon.

Other Features

Modifying relationships within the compact XML tree node data structure170 is fast since the relationships are implemented using indexes andhash codes. In exemplary uses, such a structure is much faster ascompared to DOM where the relationships involve C++ classes and datapointers.

Another aspect of the compact XML tree node data structure 170 is thatcertain types of editing can be expressly enabled, in some embodiments.Certain editing conveniences arise from the reuse of the text values inthe hash table. Since many text usages may point to a single text stringin the hash table, a particular use of a text value can be easilycreated by adding it as a new value that is pointed to by the particularuse.

Editing the tree structure can be accomplished by merely changing theindex values. However, editing the text values can be accomplished usinga technique to add a new entry into the text hash table. The nodes thatshould remain pointing to the unedited text nodes remain unaltered inthe hash table. It is also possible that an edited text value (e.g., onethat is added as a new entry into the text hash table) replaces thelast/only reference to the unedited text value, thus it follows thatthere can be some unused values in the hash table. Accordingly, someembodiments, step though the hash table to identify unused text values,and possibly compact the storage areas used for storing the identifiedunused text values.

Additional Embodiments of the Disclosure

FIG. 6 depicts a block diagram of a system to perform certain functionsof a computer system. As an option, the present system 600 may beimplemented in the context of the architecture and functionality of theembodiments described herein. Of course, however, the system 600 or anyoperation therein may be carried out in any desired environment.

As shown, system 600 comprises at least one processor and at least onememory, the memory serving to store program instructions correspondingto the operations of the system. As shown, an operation can beimplemented in whole or in part using program instructions accessible bya module. The modules are connected to a communication path 605, and anyoperation can communicate with other operations over communication path605. The modules of the system can, individually or in combination,perform method operations within system 600. Any operations performedwithin system 600 may be performed in any order unless as may bespecified in the claims. The embodiment of FIG. 6 implements a portionof a computer system, shown as system 600, comprising a computerprocessor to execute a set of program code instructions (see module 610)and modules for accessing memory to hold program code instructions toperform: allocating a first portion of the memory to store a firstmemory block for storing an XML tree index data structure (see module620); allocating a second portion of the memory to store at least aportion of the hash table (see module 630); traversing an XML documentfrom a first node to a final node and through at least one intermediatenode (see module 640); processing the traversed nodes, the processingcomprising conditional operations (see module 650); performing suchconditional operations such as, when the traversed node is an elementnode, then adding the element node to the XML tree index data structure;when the traversed node is a text node, then populating a text nodeindex into the XML tree index data structure and copying the text nodevalues to the hash table, the copied text node values accessible usingthe text node index; and when the traversed node is an attribute node,then populating an attribute node index into the XML tree index datastructure and copying the attribute node values to the hash table, thecopied attribute node values accessible using the attribute node index(see module 660); and determining when the first memory block is fullypopulated, then storing the first memory block to the persistent storagedevice and allocating a next portion of memory (see module 670).

System Architecture Overview

FIG. 7 depicts a block diagram of an instance of a computer system 700suitable for implementing an embodiment of the present disclosure.Computer system 700 includes a bus 706 or other communication mechanismfor communicating information, which interconnects subsystems anddevices, such as a processor 707, a system memory 708 (e.g., RAM), astatic storage device (e.g., ROM 709), a disk drive 710 (e.g., magneticor optical), a data interface 733, a communication interface 714 (e.g.,modem or Ethernet card), a display 711 (e.g., CRT or LCD), input devices712 (e.g., keyboard, cursor control), and an external data repository731.

According to one embodiment of the disclosure, computer system 700performs specific operations by processor 707 executing one or moresequences of one or more instructions contained in system memory 708.Such instructions may be read into system memory 708 from anothercomputer readable/usable medium, such as a static storage device or adisk drive 710. In alternative embodiments, hard-wired circuitry may beused in place of or in combination with software instructions toimplement the disclosure. Thus, embodiments of the disclosure are notlimited to any specific combination of hardware circuitry and/orsoftware. In one embodiment, the term “logic” shall mean any combinationof software or hardware that is used to implement all or part of thedisclosure.

The term “computer readable medium” or “computer usable medium” as usedherein refers to any medium that participates in providing instructionsto processor 707 for execution. Such a medium may take many forms,including but not limited to, non-volatile media and volatile media.Non-volatile media includes, for example, optical or magnetic disks,such as disk drive 710. Volatile media includes dynamic memory, such assystem memory 708.

Common forms of computer readable media includes, for example, floppydisk, flexible disk, hard disk, magnetic tape, or any other magneticmedium; CD-ROM or any other optical medium; punch cards, paper tape, orany other physical medium with patterns of holes; RAM, PROM, EPROM,FLASH-EPROM, or any other memory chip or cartridge, or any othernon-transitory medium from which a computer can read data.

In an embodiment of the disclosure, execution of the sequences ofinstructions to practice the disclosure is performed by a singleinstance of the computer system 700. According to other embodiments ofthe disclosure, two or more computer systems 700 coupled by acommunication link 715 (e.g., LAN, PTSN, or wireless network) mayperform the sequence of instructions required to practice the disclosurein coordination with one another.

Computer system 700 may transmit and receive messages, data, andinstructions, including programs (e.g., application code), throughcommunication link 715 and communication interface 714. Received programcode may be executed by processor 707 as it is received, and/or storedin disk drive 710 or other non-volatile storage for later execution.Computer system 700 may communicate through a data interface 733 to adatabase 732 on an external data repository 731. A module as used hereincan be implemented using any mix of any portions of the system memory708, and any extent of hard-wired circuitry including hard-wiredcircuitry embodied as a processor 707.

In the foregoing specification, the disclosure has been described withreference to specific embodiments thereof. It will, however, be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the disclosure. Forexample, the above-described process flows are described with referenceto a particular ordering of process actions. However, the ordering ofmany of the described process actions may be changed without affectingthe scope or operation of the disclosure. The specification and drawingsare, accordingly, to be regarded in an illustrative sense rather thanrestrictive sense.

What is claimed is:
 1. A computer implemented method for reading, into amemory, a compact tree node representation of an XML document having atree index data structure and separate node data, the method comprising:allocating a first portion of the memory to store an entire XML treeindex data structure; allocating a second portion of the memory, thesecond portion of the memory separate from the first portion of thememory; reading the entire XML tree index data structure into the firstportion of the memory, wherein the XML tree index data structurecomprises the structure of the XML tree and does not comprise theseparate node data; and processing an XML tree index entry, theprocessing comprising: accessing the XML tree index entry to retrieve areference to at least one node from the separate node data; and loadinginto the second portion of the memory, the at least one node from theseparate node data.
 2. The computer implemented method of claim 1,further comprising processing an XML tree index entry, the furtherprocessing comprising: when the reference to the at least one noderefers to a text node, then paging into the second portion of the memorya text node value from the separate node data; when the reference to theat least one node refers to a attribute node, then paging into thesecond portion of the memory an attribute node value from the separatenode data; and when the reference to the at least one node refers to anelement node, then paging into the second portion of the memory anelement node value from the separate node data.
 3. The computerimplemented method of claim 2, further comprising processing an XML treeindex entry, the further processing comprising: paging out of the secondportion of the memory the text node value.
 4. The computer implementedmethod of claim 1, wherein the XML tree index data structure isorganized as an array.
 5. The computer implemented method of claim 1,wherein the separate node data structure is organized as at least onehash table.
 6. The computer implemented method of claim 1, wherein theXML tree index data structure comprises an XML tree index for firstchild.
 7. The computer implemented method of claim 1, wherein the XMLtree index data structure comprises an XML tree index for next sibling.8. A computer system for reading, into a memory, a compact tree noderepresentation of an XML document having a tree index data structure andseparate node data, the computer system comprising: a computer processorto execute a set of program code instructions; and a memory to hold theprogram code instructions, in which the program code instructionscomprises program code to perform, allocating a first portion of thememory to store an entire XML tree index data structure; allocating asecond portion of the memory, the second portion of the memory separatefrom the first portion of the memory; reading the entire XML tree indexdata structure into the first portion of the memory, wherein the XMLtree index data structure comprises the structure of the XML tree anddoes not comprise the separate node data; and processing an XML treeindex entry, the processing comprising: accessing the XML tree indexentry to retrieve a reference to at least one node from the separatenode data; and loading into the second portion of the memory, the atleast one node from the separate node data.
 9. The computer system ofclaim 8, further comprising processing an XML tree index entry, thefurther processing comprising: when the reference to the at least onenode refers to a text node, then paging into the second portion of thememory a text node value from the separate node data; when the referenceto the at least one node refers to a attribute node, then paging intothe second portion of the memory an attribute node value from theseparate node data; and when the reference to the at least one noderefers to an element node, then paging into the second portion of thememory an element node value from the separate node data.
 10. Thecomputer system of claim 9, further comprising processing an XML treeindex entry, the further processing comprising: paging out of the secondportion of the memory the text node value.
 11. The computer system ofclaim 8, wherein the XML tree index data structure is organized as anarray.
 12. The computer system of claim 8, wherein the separate nodedata structure is organized as at least one hash table.
 13. The computersystem of claim 8, wherein the XML tree index data structure comprisesan XML tree index for first child.
 14. The computer system of claim 8,wherein the XML tree index data structure comprises an XML tree indexfor next sibling.
 15. A computer program product embodied in anon-transitory computer readable medium, the computer readable mediumhaving stored thereon a sequence of instructions which, when executed bya processor causes the processor to execute a process to read, into amemory, a compact tree node representation of an XML document having atree index data structure and separate node data, the processcomprising: allocating a first portion of the memory to store an entireXML tree index data structure; allocating a second portion of thememory, the second portion of the memory separate from the first portionof the memory; reading the entire XML tree index data structure into thefirst portion of the memory, wherein the XML tree index data structurecomprises the structure of the XML tree and does not comprise theseparate node data; and processing an XML tree index entry, theprocessing comprising: accessing the XML tree index entry to retrieve areference to at least one node from the separate node data; and loadinginto the second portion of the memory, the at least one node from theseparate node data.
 16. The computer program product of claim 15,further comprising processing an XML tree index entry, the furtherprocessing comprising: when the reference to the at least one noderefers to a text node, then paging into the second portion of the memorya text node value from the separate node data; when the reference to theat least one node refers to a attribute node, then paging into thesecond portion of the memory an attribute node value from the separatenode data; and when the reference to the at least one node refers to anelement node, then paging into the second portion of the memory anelement node value from the separate node data.
 17. The computer programproduct of claim 16, further comprising processing an XML tree indexentry, the further processing comprising: paging out of the secondportion of the memory the text node value.
 18. The computer programproduct of claim 15, wherein the XML tree index data structure isorganized as an array.
 19. The computer program product of claim 15,wherein the separate node data structure is organized as at least onehash table.
 20. The computer program product of claim 15, wherein theXML tree index data structure comprises an XML tree index for firstchild.